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Abstract 

We construct a classifier which attains the rate of convergence logn/n under 
sparsity and margin assumptions. An approach close to the one met in approx- 
imation theory for the estimation of function is used to obtain this result. The 
idea is to develop the Bayes rule in a fundamental system of -^^^([0, l]'^) made 
of indicator of dyadic sets and to assume that coefficients, equal to —1,0 or 1, 
belong to a kind of L^— ball. This assumption can be seen as a sparsity as- 
sumption, in the sense that the proportion of coefficients non equal to zero 
decreases as "frequency" grows. Finally, rates of convergence are obtained by 
using an usual trade-off between a bias term and a variance term. 

1 Introduction 

Consider a measurable space {X,A) and it a probability measure on this space. 
Denote by Z)„ = (Xj,yj)i<j<„ n observations of {X,Y) a random variable with 
values in X X { — 1, 1} distributed according to vr. We want to construct measurable 
functions which associate a label y G { — 1, 1} to each point x of X, such functions 
are called prediction rules. The quality of a prediction rule / is given by the value 

R{f) = P(/(X) ^ Y) 
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called mis classification error of f. It is well known (e.g. Devroye et al. [1996]) that 
there exists an optimal prediction rule which attains the minimum of R over all 
measurable functions with values in { — 1, 1}. It is called Bayes rule and defined by 

r(x)=sign(2r7(x)-l), 

where rj is the conditional probability function ofY — 1 knowing X defined by 

ri{x) = F{Y = 1\X = x). 

The value 

R*^R{f*)^mmR{f) 

is known as the Bayes risk. The aim of classification is to construct a prediction 
rule, using the observations D^, which has a risk as close to R* as possible. Such a 
construction is called a classifier. Performance of a classifier /„ is measured by the 
value 

SMn)=K[R{fn)-R*] 

called excess risk of fn- In this case -R(/„) = P(/„(X) 7^ Y\Dn) and E^r denotes 
the expectation w.r.t. when the probability distribution of {Xi,Yi) is tt for any 
i — 1, . . . ,n. We say that a classifier learns with the convergence rate (f){n), where 
is a decreasing sequence, if an absolute constant C > exists such that 
for any integer n, E^[it!(/„) - R*] < C(l){n). 

We introduce a loss function on the set of all prediction rules: 

This loss is a semi-distance (it is symmetric, satisfies the triangle inequality and 
dn{f,f) = 0). For all classifiers /„, it is linked to the excess risk by 

£:,(/„) =E,[4(A,r)], 

where the RHS is the risk of /„ associated to the loss d^^. In classification we can 
consider three estimation problems. The first one is estimation of the Bayes rule /*, 
the second one is estimation of the conditional probability function r) and the last 
one is estimation of the probability tt. Usually, estimation of r] involves smoothness 
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assumption on the conditional function rj. However, global smoothness assumptions 
on f] are somehow too restrictive for the estimation of /* since the behavior of rj 
away from the decision boundary {x E X : ri{x) = 1/2} may have no effect on the 
estimation of /*. 

In this paper we deal directly with estimation of /*. But, in this case, the main 
difficulty of the classification problem is the dependence on vr of the loss d^^ (usually, 
we use a loss free from tt, which upper bounds d^^ to obtain rates of convergence). 
Moreover, using the loss d.,r, we don't have the usual bias/variance trade-off, unlike 
many other estimation problems. This is due to the fact that we do not have an 
approximation theory in classification for the loss d^^. This gap is due to the dif- 
ficulty that d^^ depends on vr, thus, this theory has to be uniform on tt. We need 
approximation results of the form: 

Vtt = (P^, T)) eV,ye> 0, 3/, e : d^{f„ f*) < e, (1) 

where is the marginal distribution of tt on X, f* = sign(2?7 — 1), V is a set of prob- 
ability measures on X x { — 1, 1} and the family of classes of prediction rules {J-'e)e>o 
is decreasing (JF^ C JF^' if e' < e) and JF^ is less complex than {/* : vr G V}, in fact we 
expect JF^ to be parametric. Similar results appear in density estimation literature, 
where, for instance, V is replaced by the set of all probability measures with a density 
with respect to the Lebegue measure lying in an L^— ball and JF^ is replaced by the 
set of all functions with a finite number (depending on e) of coefficients non equal to 
zero in the decomposition in the chosen orthogonal basis. But approximation theory 
in density estimation does not depend on the underlying probability measure since 
the loss functions used there are generally independent of the underlying statistical 
problem. In this paper, we deal directly with the estimation of the Bayes rule and 
obtain convergence result w.r.t. the loss by using an approximation approach of 
the Bayes rules w.r.t. c^tt. Theorems in Section 7 of Devroye et al. [1996] show that 
no classifier can learn with a given convergence rate for arbitrary underlying proba- 
bility distribution tt. Thus, assumption on /* has to be done to obtain convergence 
rates. In this paper, assumption on /* is close to the one met in density estimation 
when we assume that the underlying density belongs to an L^— ball. 

Usually, a model (set of measurable functions with values in {—1, 1}) is considered 
and we assume that the Bayes rule belongs to this model. In this case the bias is 
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equal to zero and no bound on the approximation term is considered. In Blancliard 
et al. [2003] , question on the control of the approximation error for a class of models 
in the boosting framework is asked. In this paper, it is assumed that the Bayes rule 
belongs to the model and nature of distribution satisfying such condition is explored. 
Another related work is Lugosi and Vayatis [2004], where, under general conditions, 
it can be guaranteed that the approximation error converges to zero for some specific 
models. In the present paper, bias term is not taken equal to zero and convergence 
rates for the approximation error are obtained depending on the complexity of the 
considered model (cf. Theorem 2). 

We consider the classification problem on X = [0, l]'^. All the results can be 
generalized to a given compact of M.'^. Like in many other works on the classification 
problem an upper bound for the loss d^^ is used. But, in our case we still work directly 
with the estimation of /*. For a prediction rule / we have 

dM,n = n\Mx) - < ii/i - rwLHP-y (2) 

In order to get a distribution-free loss function, we assume that the following as- 
sumption holds 

(Al) The marginal is absolutely continuous w.r.t. the Lebesgue measure and 
< a < dP^{x)/d\d <A< +00, V,x G [0, 1]"=^. 

This is a technical assumption used for the control of the measure of some 
subset of [0, 1]*^. In recent years some assumptions have been introduced to measure a 
statistical quality of classification problems. The behavior of the regression function 
77 near the level 1/2 is a key point of the classification's quality (cf. e.g. Tsybakov 
[2004]). In fact, the closest is 77 to 1/2, the more difficult is the classification problem, 
nevertheless when we have rj = 1/2 the classification is trivial since all prediction 
rules are Bayes rules. Here, we measure the quality of the classification problem 
thanks to the following assumption introduced by Massart and Nedelec [2003] : 
Strong Mcirgin Assumption (SMA): There exists an absolute constant < h <1 
such that: 

¥{\2ri{X) - 1| > /i) = 1. 
Under assumptions (Al) and (SMA) we have 

ah\\h - riii,(,,) < 4(/i,r) < All A - rii^,(A,). 
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Thus, estimation of /* w.r.t. the loss rf^ is the same as estimation w.r.t. Li(Ad)— norm, 
where is the Lebesgue measm^e on [0, 1]''. 

The paper is organized as follows. In the next section we propose a representation 
for functions with values in { — 1, 1} in a fundamental system of L'^{[0, lY). The third 
section is devoted to approximation and estimation of Bayes rules having a sparse 
representation in this system. In the fourth section we discuss about this approach. 
Proofs are given in the last section. 



2 Classes of Bayes Rules with Sparse Representa- 
tion 

Theorem 2 of Subsection 3.1 is about the approximation of the Bayes rules when we 
assume that /* belongs to a kind of "L^— ball" for functions with values in {—1, 1}. 
The idea is to develop /* in a fundamental system of -L^([0, 1]'', P-^) (that is a count- 
able family of functions such that the set of all finite linear combinations is dense in 
L^([0, 1]'^, -P^)) inherited from the Haar basis and to control the number of coeffi- 
cients non equal to zero. In this paper we only consider the case where satisfies 
(Al). We can extend the study to a more general case by taking another partition 
of [0, 1]'^ adapted to P^. 

First we construct such a fundamental system. We consider a sequence of parti- 
tions oi X — [0, 1]"^ by setting for any integer j, 



t 

-'-'1 



U) 



where k is the multi-index 



k=(A;i,...,A;,)e/,(j) = {0,l,...,2^-l}^ 
and for any integer j and any k & {1, . . . , 2-' — 1}, 



rU) 



|,^) if A; = 0,...,2^-2 



2J ' 



if A; = 2^' - 1 



We consider the family S — (j)^^^ : j e N, k e Idij)j where 



4') -1^0), VjeN,ke/rf(j), 



where Ha denotes the indicator of a set A. Set 5 is a fundamental system of 
L^([0, l]'^, P^). This is the class of indicators of the dyadic sets of [0, 1]'^. 

We consider the class of functions / defined — a.s. from [0, l]'^ to { — 1,1} 
which can be written in this system by 

+ 00 

/ = E E 4'VL^\P^-a-«-, where aL^^e {-1,0,1}, 
3=0 ke/dO') 

where, for any point x G [0, l]'^, the right hand side applied in a: is a finite sum. 
Denote this class by JF^^). In what follows, we use the vocabulary appearing in 
the wavelet literature. The index "j" of a^'' and 0^'' is called "level of frequency". 
Since S is not an orthogonal basis of L'^{[0, lY,P^), the expansion of / w.r.t. this 
system is not unique. Therefore, to avoid any ambiguity, we define an unique 
writing for any mapping / in J-'^^^ by taking a^^ G { — 1,1} with preferences for 
low frequencies when it is possible. Roughly speaking, for / G J-'^^\ denoted by 

/ = X]j=o Ske/d(?) '^k ^ -^"^ ~ '^k'' ^ {"IjOjl}) it means that, we 

construct A^^^ G { — 1,0, l},j G N, k G Id{j), such that, if there exists J G N and 
k G Id{J) such that for all k' G 7^(7 + 1) satisfying 4>^^4>^'~^^^ ^ we have a^^i^^^ = 1, 
then we take A^^,^ = 1 and the 2^ other coefficients of higher frequency = 

instead of having these 2'^ coefficients equal to 1, and the same convention holds for 
-1. Moreover if we have A^^"^ ^ then J^^} = for all J > Jq and k' G /^(J) 
satisfying 7^ 0. We can describe a mapping / G J-'^'^^ satisfying this con- 

vention by using a tree. Each knot corresponds to a coefficient A^\ The root is 
A^Q^ Q. If a knot, describing the coefficient A^\ equals to 1 or —1 then it has no 
branches, otherwise it has 2'^ branches, corresponding to the 2'^ coefficients at the 
following frequency, describing the coefficients A^f^^-* for k' satisfying (1)^^ 4'k'~^^^ ^■ 
At the end all the leaves of the tree equals to 1 or —1, and the depth of a leaf is 
the frequency of the coefficient associated. The writing convention says that a knot 
can not have all his leaves equal to 1 together (or —1). In this case we write this 
mapping by putting a 1 at the knot (or —1). In what follows we say that a function 
/ G J-''^'^^ satisfies the writing convention (W) when / is written in S using the writing 
convention describes in this paragraph. Remark that this writing convention is not 
an assumption on the function since we can write all / G ^ using this convention. 
Representation of the Bayes rules using Dyadic decision trees has been explored by 
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Nowak and Scott [2004]. 

Is it possible to write every measurable functions from [0, l]'^ to { — 1, 1} in the 
fundamental system S using coefficients with values in { — 1,0, 1}? Since the family 
of set {i^^ : j G N, k G where A denotes the interior of A, is a basis of open 

subsets of [0, l]'^, this question is equivalent to this one: "Take A a Borel of [0, l]*^, 
is it possible to find an open subset O of [0, 1]°' such that the symmetrical difference 
between A and O has a Lebesgue measure 0?" Unfortunately, the answer to this last 
question is negative. There exists F C [0, l]'^ a Borel, closed, with an empty interior 
and a positive Lebesgue measure Arf(-F) > 0. For example, in the one dimension 
case, the following algorithm yields such a set. Take (lk)k>i a sequence of numbers 
defined by = 1/2 — 1/(A: + 1)^ for any integer k. Denote by Fq the interval [0, 1] 
and construct a sequence of closed sets {Fk)k>o fike in the following picture. 



h 1 - 2/i h 

hh hh hh hh 

h{l - 2h) h{l - 2h) 

hhh hhh 
hhil-'2ls) 

It is easy to check that F = r\k>oFk is closed, with an empty interior and a 
positive Lebesgue measure. For the li- dimensional case, the set F x [0, l]'^^^ satisfies 
the required assumptions. Thus, take F such a set and O an open subset of [0, l]''. 
If C C F then C = because F = and Xd{FAO) = Arf(F) > 0. If O ^ F then 
CnF'^ is an open subset of [0, 1] none empty, so Ad(CAF) > Arf(CnF^) > 0. Thus, 
every measurable functions from [0, 1]*^ to { — 1, 1} can not be written in S using only 
coefficients with values in {—1,0,1}. Nevertheless, the Lebesgue measure satisfies 
the property of regularity, which says that for any Borel B G [0, 1]"^ and any e > 0, 
there exists a compact subset K and an open subset O such that K C A C O and 



Fi 

F2 

Fg 
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Xd{0 — K) < t. Hence, one can easily check that for any measurable function / 
from [0, 1]*^ to {—1, 1} and any e > 0, there exists a function g G JF*^'^) such that 
Arf({x G [0, l]'^ : f{x) 7^ g{'x)}) < Thus, JF^*^) is dense in L'^{\d) intersected with 
the set of all measurable functions from [0, l]*^ to { — 1, 1}. Now, we exhibit some 
usual prediction rules which belong to 

Definition 1. Let A be a Borel subset of [0, 1]''. We say that A is almost every- 
where open if there exists an open subset O of [0, l]'' such that Xd{AAO) — 0, 
where Xd is the Lebesgue measure on [0, l]*^ and A AO is the symmetrical difference. 

Theorem 1. Let rj be a function from [0, l]'^ to [0, 1]. We consider 



1 if r]{x)> 1/2 
-1 otherwise. 



We assume that {77 > 1/2} and {q < 1/2} are almost everywhere open. Thus, there 
exists g ^ T such that for Xd-almost every x G [0,1]*^, g — fr,,Xd — a.s.. For 
instance, if Xd{d{r] — 1/2}) = and, either rj is Xd-almost everywhere continuous (it 
means that there exists an open subset of[0, 1]'^ with a Lebesgue measure equals to 1 
such that 77 is continuous on this open subset) or if r] is Xd— almost everywhere equal 
to a continuous function, then G J-^*^^ . 

Now, we define a model for the Bayes rule by taking a subset of !F^'^\ For all 
functions w defined on N and with values in 1R+, we consider Tw \ the model for 
Bayes rules, made of all prediction rules / which can be written, using the previous 
writing convention (W), by 

+00 



where a^'' G {—1, 0, 1} and 

card{kGld(j) :4'Vo} <w(j), Vj G N. 

The class jF^f"* depends on the choice of the function w. If -u; is too small then the 
class Tw"^ is not very rich, that is the subject of the following Proposition 1. If w is 
too large then Tw^ would be too complex for a good estimation of /* G J-w \ that is 
why we introduce Definition 2 in what follows. 
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Proposition 1. Let w he a mapping from N to such that w{Q) > 1. The two 
following assertions are equivalent: 



And if w is too large then the approximation by a parametric model will be 
impossible, that is why we give a particular look on the class of function introduced 
in the following Definition 2. 

Definition 2. Let w be a mapping from N to M"*". If w satisfies 



then we say that Tw is a L-*^— ball of prediction rules. 

Remark 1. We say that J-'w^ is a "L^ — ball" for a function w satisfying (3), because , 
the sequence {\_w{j)\)j^^ belongs to a L^~ball ofN^, with radius {2'^^)j^^. Moreover, 
definition 2 can be link to the definition of a L^ — ball for real valued functions, since 
we have a kind of base, given by S, and we have a control on coefficients which 
increases with the frequency. Control on coefficients, given by (3), is close to the one 
for coefficients of a real valued function in L^ — ball since it deals with the quality of 
approximation of the class J-'w^ by a parametric model. 

Remcirk 2. A L^ — ball of prediction rules is made of "sparse" prediction rules. In 
fact, for f e J-^ , the repartition of coefficients non equal to zero in the decompo- 
sition of f at a given frequency becomes sparse as the frequency grows. That is the 
reason why Tw^ can be called a speirse class of prediction rules. For exemple, 
if {\w{j)\/'^'^^)j>i decreases and (3) holds then number of coefficients non equal to 
at the frequency j is smaller than j"^ per cent of the maximal number of coefficients 
(that is 2*;. 

Remark 3. If we assume that is known then we can work with any measurable 
space X endowed with a Lebesgue measure X, while assuming that << A. In this 
case, we take (x^^ : j e N, k e Id{j)) , such that for any j e N, (l^^ : k e Id{j)) is 



(i) 7^ {I[0,l]4- 

(u) e;=°^2-'^^mj)j>i- 




+ 0O 



(3) 



j=0 
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a partition of X adapted to the previous one yl^^ ^•^ : k G Id{j — l)j and satisfying 
P^{l'i^) = 2-^'^. All the results below can be obtained in this framework. 

Now, examples of functions satisfying (3) are given. Classes associated to 
these functions are used in what follows to define statistical models. As an intro- 
duction we define the minimal infinite class of prediction rules, by J^q''^ which is the 
class J^w^ for w — wjy' where wlf\o) = 1 and wlf\j) — 2'^ — 1, for all j > 1. To 
understand why this class is important we introduce a notion of local oscillation of a 
prediction rule. This concept defines a kind of " regularity" for functions with values 
in {-1,1}. 

Definition 3. Let f be a prediction rule from [0, 1]*^ to {—1, 1} in We consider 

the writing of f in the fundamental system introduce in Section 3.1 with writing 
convention (W): 

+00 

Let J e N and k e hij)- We say that I^^^ is a low oscillating block of f when 
f has exactly 2^ — 1 coefficients, in this block, non equal to zero at each level of 
frequencies greater than J + 1. In this case we say that f has a low oscillating 
block of frequency J. 

Remark that, if / has an oscillating block of frequency J, then / has an oscillating 
block of frequency J', for all J' > J. The function class J^^^^ is made of all prediction 
rules with one oscillate block at level 1 and of the indicator function lI[o,i]d. If we have 
w(jo) < WQ^\jo) for one jo > 1 and w(j) — w^\j) for j ^ jo then the associated 
class J^'^ contains only the indicator function lI[o,i]d, that is the reason why we say 
that is " minimal" . 

Nevertheless, the following proposition shows that is a rich class of prediction 
rules from a combinatorial point of view. We recall some quantities which measure 
a combinatorial richness of a class of prediction rules. For any class T of prediction 
rules from A" to {—1, 1}, we consider 

(xi, . . . , Xrr^) = card ({(/(xi), . . . , : / G J^}) 
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where xi, . . . ,Xm ^ and m e N, 

S{J^, m) = max {N{T, (xi, . . . , Xm)) : Xi, . . . , e A") 
and the V^C-dimension of is 

VC{r) = min (m e N : S{J^, m) ^ 2™) . 

Consider Xj — ^^^f, -^+1, ■ ■ ■ , 23Tr) , for any j e N. Thus, for any integer m, we 
have N{!Flf'\ {xi, . . . , Xm)) = 2"^. Hence, the following proposition holds. 

Proposition 2. The class of prediction rules Tq'^ has an infinite VC -dimension. 

Thus every class jF^f"* such that u) > w^''^ has an infinite yC-dimension (since 
w < w' ^ J-'w^ C J-"^}), which is the case for the following classes. 

Now, we introduce some examples of L-*^— ball of Bayes rules. We denote by J-'x\ 
for a G N*, the class J-'w^ of prediction rules where w is equal to the function 



2'^^ if j < K, 
2'^^ otherwise. 



This class is called the truncated class of level K. 

We consider exponential classes. These sets of prediction rules are denoted by 
J-a\ where < a < 1, and are equal to J-'w^ when w = Wa^ and 



2^^^ ifj<ArW(Q;) 
2^°"^ otherwise 



where N^'^\a) = inf (TV e N : 2'^°^ > 2"^ - l), that is for N^'^\a) = flog(2'='-l)/((ialog 

Remcirk 4. For the one- dimensional case, an other point of view is to consider 
f* e L'^{[0, 1]) and to develop f* in an orthogonal basis o/L^([0, 1]). Namely, 

/•"EE"?*'?. 

jeN k=0 

,0) _ fi f*r^\j,U)i 



where a)! — f*[x)ip^ {x)dx for any j E N and A; = 0, ... .2-' — 1. For the control of 

4'' 



the bias term we assume that the family of coefficients {a^i^\j G N, A; = 0, . . . , 2-^ — 1) 
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belongs to a L^ — hall. But this point of view leads to analysis and estimation issues. 
First problem: Which functions with values in {—1, 1} have wavelet coefficients in 
a L^ — ball and which wavelet basis is more adapted to our problem (maybe the Haar 
basis)? Second problem: Which kind of estimators could be used for the estimation 
of these coefficients? As we can see, the main problem is that there is no approxima- 
tion theory for functions with values in { — 1, 1}. We do not know how to approach, 
in L'^{[0, 1]), measurable functions with values in { — 1,1} by "parametric" functions 
with values in {—1, 1}. Methods developed in this paper may be seen as a first step 
in this field. We can generalize this approach to functions with values in Z. Re- 
mark that when functions take values in R, that is for the regression problem, usual 
approximation theory is used to obtain a control on the bias term. 

RemEirk 5. Other sets of prediction rules are described by the classes where w 
is from N to M.'^ and satisfies 

where (aj)j>i is an increasing sequence of positive numbers. 

3 Rates of Convergence over JSlf^ under (SMA) 
3.1 Approximation Result 

Let w he a, function from N to R"*" and ^4 > 1, we denote by Vw,a the set of all 
probability measures tt on [0, l]'' x {—1, 1} such that the Bayes rules /*, associated 
to TT, belongs to and the marginal of tt on [0, l]'' is absolutely continuous and 
one version of its Lebesgue density is upper bounded by A. The following Theorem 
can be seen as an approximation Theorem for the Bayes rules w.r.t. the loss d^^ 
uniformly in tt G Vw,a- 

Theorem 2 (Approximation Theorem). Let J-'w^ be a L^ — ball of prediction 
rules. We have: 

Ve>0,3J,eN:V7reP^,^,3/,= ^ S^^^ 

ke/d(Je) 
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where B^'^ G { — 1, 1} and 

where f* is the Bayes rule associated to n. For example, Jg can be the smallest 
integer J satisfying X^j^+i 2"*^^ ['u;(j)J < e/A. 

RemEirk 6. No assumption on the quality of the classification problem, like an as- 
sumption on the margin, is needed to state Theorem 2. Only assumption on the 
"number of oscillations" of f* is used. Theorem 2 deals with approximation of func- 
tions in the L^ — ball Tw^ by functions with values in {—1,1} and no estimation issues 
are met. 

Remark 7. Theorem 2 is the first step to prove an estimation theorem using a 
trade-off between a bias term and a variance term. We write 

SMn) = K [dnifn, /*)] < K [dn{fn, fe)] + d,{f„ /*). 

Since f^ belongs to a parametric model we expect to have a control of the variance 
term, (^7r(/n) /e)j ; depending on the dimension of the parametric model which is 
linked to the quality of the approximation in the bias term. 

Remark 8. Since d.,^{f*,f^) = E [|277(X) — ; the closest to 1/2 rj is, 

the smallest the bias is. Especially, we have a bias equal to zero when rj = 1/2 (in 
this case any prediction rule is a Bayes rules). Thus, more difficult the problem of 
estimation is (that is for underlying probability measure vr = {P^,ri) with rj close 
to 1/2), the smallest the bias is. This behavior does not appear clearly in density 
estimation. 

3.2 Estimation Result 

We consider the following class of estimators indexed by the frequency rank J e N: 

/y^= E ^'^''^ (4) 

kG/d(J) 

where coefficients are defined by 

^ I 1 if e 7^) and card |i : | > card |i : 

I —1 otherwise 
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To obtain a good control of the variance term, we need to assure a good quality 
of the estimation problem. Therefore, estimation results are obtained in Theorem 3 
under (SMA) assumption. In recent years we have understood that (SMA) assump- 
tion can lead to fast rates but is not enough to assure any rate of convergence (cf. 
corolary 1 at the end of section 3.3), thus we have to define a model for rj or /*, here 
we use a L^— ball of prediction rules as a model for /*. 

Theorem 3 (estimation Theorem). Let be a L^ — ball of prediction rules. 
Let TT be a probability measure on [0, 1]*^ x {—1, 1} satisfying assumptions (Al) and 
(SMA), and such that the Bayes rule, associated to ^i, belongs to Tw^ . The excess 
risk of the classifier fn"^ satisfies for any positive number e, 

< (1 + A)€ + exp (-na(l - exp{-h'^ /2))2-'^'^') , 

where Jg is the smallest integer satisfying X]j=j^+i 2"*^^ [^(j)] < e/A. Parameters 
a, A appear in Assumption (Al) and h is used in (SMA). 

Remcirk 9. The upper bound can be split in the bias term: e and the variance 
term: Ae + exp (— na(l — exp(— /i^/2))2~^'^") . Remark that a bias term appears in 
the variance term. 

3.3 Optimality 

This section is devoted to the optimality, in a minimax sense, of estimation in clas- 
sification models such that /* e J^w^ . Let 0</i<l,0<a<l<74< +00 and w 
a mapping from N to M"*". we denote by Vw,h,a,A the set of all probability measures 
TT = (P^, 77) on [0, l]*^ X {-1, 1} such that 

1. The marginal satisfies (Al). 

2. The Assumption (SMA) is satisfied. 

3. The Bayes rule /*, associated to tt, belongs to Tw^ . 

We use the version of Lemma of Assouad in the appendix of Lecue [2006c] to lower 
bound the minimax risk on Vw,h,a,A- From Theorem 3 and Theorem 4, we can deduce 
the optimality (up to a logarithm term) of the estimator where the rank J„ is 
obtained by an optimal trade-off between the bias term and the variance term. 



£.(/y^)) = E. c?.(/y^\r) 
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Theorem 4. Let w be a function from N to such that 

(i) \_w{Q)\ > 1 and Vj > 1, \_w{j)\ > 2'^ - 1 

(ii) Vj>l, \_w{3-l)\ >2-'\w{3)\. 
We have for all n e N, 

inf sup > Con-^ ([w ([log n/((i log 2) J + 1)J - (2^ - 1)) , 

and if lw{j)\ > 2'^, Vj > 1 then inf^^ sup^g^^^^^^^^ > Con'^ where Cq = 

{h/8) exp (-(1 - VT^)) . 

Remcirk 10. For a function w satisfying assumptions of Theorem 4 o-nd under 
(SMA), we can not expect a convergence rate faster than 1/n, which is the usual 
lower hound for the classification problem under (SMA). 

From the previous Theorem we obtain immediately Theorem 7. 1 of Devroye et al. 
[1996]. We denote by V\ the class of all probability measures on [0, 1]*^ x {—1, 1} such 
that the marginal distribution is A,^ (the Lebesgue probability distribution on 
[0, l]'') and (SMA) is satisfied with the margin h — 1. The case "h— 1" is equivalent 
to R* — 0. That is for a perfect classification problem, where Y is an exact function 
ofXgivenbyy = r(X) = 77(X). 

Corollary 1. For any integer n we have 

inf sup £{fn) > 
fn TzeVi oe 

It means that no classifier can achieve a rate of convergence in the classification 
models Vi, even if these classification problems are all very good {Y is given by 
f*{X) without any noise and there are no spot of low probabihty). 

3.4 Rates of Convergence for Different Classes of Prediction 
Rules 

In this section we apply results stated in Theorem 3 and Theorem 4 to different 
L^— ball classes J^w^ introduced at the end of Section 2. We give rates of convergence 
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and lower bounds for these models. Using notations introduced in Section 2 and 
subsection 3.3, we consider the following models. For w = ui^' denote by Pj^"* the 
set V (d) , . of probabihty measures on [0, 1]^ x { — 1, 1} and Va^ for w = Wa\ 

Theorem 5. For the truncated class we have 

sup SMi'-^) < CK,H,a,J^, 

where CK,h,a,A > depending only on K, h, a, A and for the lower bound, there 
exists CQ^K,h,a,A > depending only on K, h, a, A such that, for all n 

inf 



For the exponential class Ta where < o; < 1, we have for any integer n 



l-a 



sup < c;,,,^ 

where C'^ f^^^ j^ > and for the lower bound, there exists CQ^^^h^a,A > depending only 
on a, h, a, A such that, for all n eN, 

inf sup SM > C'o,.,h,a,A^-'^''- 

In both classes, order of Jn is [log (an/(2'^logn)) /(dlog2)], up to a multiplying 
constant. 

A remarkable point is that the class J-'^^ has an infinite VC-dimension (cf. Section 
2). Nevertheless, the rate logn/n is achieved on this model. 



4 Discussion 

In this section we discuss about representation and estimation of "simple" prediction 
rules in our framework. In considering the classification problem over the square 
[0, 1]^, a classifier has to be able to approach, for instance, the "simple" Bayes rule 
/(? which is equal to 1 inside C, where C is a disc of [0, 1]^, and —1 outside C. In our 
framework, two questions need to be considered: 
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• How is the representation of tlie simple function in our fundamental system, 
using only coefficients with values in { — 1, 0, 1} and with the writing convention 
(W)? 

• Is the estimate /n"^"'*, where J„ = [log (an/ (2*^ log n)) /(dlog2)] is the fre- 
quency rank appearing in Theorem 5, a good classifier when the underlying 
probabifity measure has for Bayes rule? 

At a first glance, our point of view is not the right way to estimate In this 
regular case (the border is an infinite differentiable curve), the direct estimation of 
the border is a better approach. The main reason is that a 2-dimensional estima- 
tion problem becomes a 1-dimensional problem. Such reduction of dimension makes 
estimation easier (in passing, our approach is specifically good in the 1-dimensional 
case, since the notion of border does not exist in this case). Nevertheless, our ap- 
proach is apphcable for the estimation of such functions (cf . Theorem 6) . Actually, 
direct estimation of the border reduces the dimension but there is a big waste of 
observations since observations far from the border are not used for this estimation 
point of view. It may explain why our approach is applicable. Denote by 

Af{A,e,\\.\\^)^mm {N : 3xi,...,xn eR^:AC uf^^B^{xj,e)) 

the e— covering number of a subset A of [0, 1]^, w.r.t. the infinity norm of IR^. For 
example, the circle C = {{x,y) e R"^ : {x - l/2f + {y - l/2f = (1/4)2} satisfies 
jV(C,e, IMIoo) < (7r/4)e"^ For any set A of [0, 1]^ denote by dA the border of A. 

Theorem 6. Let A be a subset of [0, 1]^ such that J\f{dA, e, 1 1.| |oo) < S{e), for any e > 
0, where 6 is a decreasing function from with values in satisfying e'^6{e) — > 
when e tends to zero. Consider the prediction rule f^ = 21^ — 1. For any e > 0, 
denote by eg the greatest positive number satisfying 5(eo)eQ < e. There exists a 
prediction rule constructed in the fundamental system S at the frequency rank J^^ 
with coefficients in {—1,1} denoted by 

with — [log(l/eo)/log2j such that 

ll/eo - IaWl^x^) < 36e. 
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For instance, there exists a function /„, written in the fundamental system S at 
the frequency level J„ = [log(4n/(7r logn))/ log2j , which approaclics the prediction 
rule fc with a L^{X2) error upper bounded by 36(log n)/n. This frequency level is, 
up to a multiplying constant, the same one appearing in Theorem 5. In a more 
general way, any prediction rule with a border having a finite perimeter (for instance 
polygons) is approached by a function written in the fundamental system at the same 
frequency rank J„ and the same order of L^{X2) error (log n)/n. Remark that for 
this frequency level J„, we have to estimate n/logn coefficients. Estimations of one 
of these coefficients a^^"'\ where k G hiJn), depends on the number of observation 
in the square 2^"^ associated this coefficient. The probability that no observation 
"falls" in X^"^"'* is smaller than n^^. Thus, number of coefficients estimated with 
no observations is small compare to the order of approach (logn)/n and is taken 
into account in the variance term. Now, the problem is about finding a L^— ball of 
prediction rules such that for any integer n the approximation function /„ belongs 
to such a ball. This problem depends on the geometry of the border set dA. It arises 
naturally since we chose a particular geometry for our partition: dyadic partitions 
of the space [0, l]'^, and we have to pay a price for this choice which has been made 
independently of the type of functions to estimate. But this choice of geometry in 
our case is the same as the one met in density approximation using approximation 
theory while choosing a particular wavelet basis. Depending on the type of Bayes 
rules we have to estimate, a special partition can be considered. For example our 
" dyadic approach" is very well adapted for the estimation of Bayes rules associated 
to chessboard (with the value 1 for black square and —1 for white square). This 
kind of Bayes rules are very bad estimated by classification procedure estimating 
the border since most of these procedure have regularity assumptions which are not 
fulfilled in the case of chessboard. 

We can extend our approach in several different ways. Consider the dyadic par- 
tition of [0, 1]"^ with frequency J„. Instead of choosing 1 or —1 for each square of 
this partition (like in our approach), we can do a least square regression in each cell 
of the partition. Inside a square Sq = T^"\ where k e hiJn)-, we can compute the 
line minimizing 



n 




i=l 
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where / is taken in the set of all indicators of half spaces of [0, l]'^ intersecting Sq. Of 
course, depending on the number of observations inside the cell Sq we can consider 
bigger classes of functions than the one made of the indicators of half spaces. Our 
classifier is close to the histogram estimator in density or regression framework, which 
has been extend to smoother procedure. The other way to extend our approach deals 
with the problem of the underlying choice of geometry by taking S for fundamental 
system. One possible solution is to consider classifiers ''adaptive to the geometry". 
Using an adaptive procedure, for instance aggregation procedure (cf. Lecue [2005]), 
we can construct classifiers adaptive to the "rotation" and "translation". Consider 
the dyadic partition of [0, 1]^ at the frequency level J„. We can construct classifiers 
using the same procedure as (4) but for partitions obtained by translation of the 
dyadic partition by {ni/{2'^" logn), n2/(2'^" logn)), where ni,n2 = 0, . . . , [logn]. We 
can do the same thing by aggregating classifiers obtained by the procedure (4) for 
partitions obtained by rotation of center (1/2, 1/2) with angle n37r/(2 logn), where 
n3 = 0, . . . , [logn], of the initial dyadic partition. In this heuristic we don't discuss 
about the way to solve problems near the border of [0, 1]^. 

5 Proofs 

Proof of Theorem 1: Since {r] > 1/2} is almost everywhere open there exists an 
open subset O of [0, 1]"* such that Xdi{r] > 1/2} AC) = 0. If O is the empty set then 
take g — —1, otherwise, for all x G O denote by Xj. the biggest siibset I^^ for j G N 
and k e Id{j) such that x e I^^ and X^'' C O. Remark that exists because O 
is open. We can see that for any y G we have Xy = X,., thus, {Ix x E O) is a 
partition of O. We denote by Iq a subset of index (j, k), where j G N, k G Id{j) 
such that {Ox : X e O} ^ {X^^^ : (j, k) G lo}- For any (j, k) G lo we take a^^^ = 1. 

Take Oi an open subset A^-almost everywhere equal to {77 < 1/2}. If Oi is the 
empty set then take g — 1- Otherwise, consider the set of index Iq^ built in the same 
way as previously, and for any (j, k) G loi we take a^^ — —1. 

For all (j, k) ^ 7o U loi , we take a^^ — 0. Consider 
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It is easy to check that the function g belongs to JF*^'^) and satisfies the writing 
convention (W) and that, for A^^— almost x G [0, l]"*, g{x) = fr,{x). 

Proof of Proposition 1: Assume that J^w^ ^ {E^q,!]'*}- Take / G Tw^ — \\\^^xy\- 
Consider the writing of / in the system S using the convention (W), 

/ = E E <^^^ 

where a^^^ G { — 1,0,1} for any j G N, k G Id{j)- Consider b^^^ — \a~^\ for any 
j G N,k G Take /s = Ejgn Eke/dO) ^'^k^- R-^mark that the function 

/2 G T'^'^^ does not satisfy the writing convention (W). We have ji — lI[o,i]d- For any 
j G N we have 

card {k G U{3) ■ ^ o} = card {k G Uif) : a^^ ^ o} . (5) 

Moreover, one coefficient fe^"*^ 7^ contributes to fill a cell of Lebesgue measure 2~'^^ 
among the hypercube [0, 1]*^. Since the mass total of [0, l]*^ is 1, we have 

l^Y^Y. 2"''card {k G h{3) : ^ o} . (6) 

Moreover, / G JF^'^) thus, for any j G N, 

L«;(j)J>card{kG/,(i):agVo}. 

We obtain the second assertion of Proposition 1 by using the last inequality and the 
both assertions (5) and (6). 

Assume that Ej=^ 2^'^-' [w(j)J > 1. For any integer j ^ 0, denote by T{j) the set 
of indexes {(j, k) : k G 

We use the natural order of N'^^^ to order sets of indexes. Take X^(l) the family 
of the first [w(l)J elements of X(l). Denote by Xu,(2) the family made of the first 
[1^(1)] elements of X(l) and add, at the end of this family in the correct order, the 
first [w{2)\ elements (2,k) of J(2) such that 0k'Vk^ = for any (l,k') G J„,(l),..., 
for the step j, construct the family X^(j) made of all the elements of X^(j — 1) in the 
same order and add at the end of this family the indexes (j, k) in among the first 
[w(j)J elements of X(j) such that ^^'^Vk'' ~ {^1^) ^ ^w{j ^ !)• If there 

is no more index satisfying this condition then we stop the construction otherwise 
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we go on. Denote by X the final family obtained by this construction (X may be 
finite or infinite). Then, we enumerate the indexes of X by (ji, ki) ^ (j2, -<■■■. 
For the first (ji,ki) G X take a^^^ = 1, for the second element (j2,k2) € X take 
'^kf ~ — l,etc. . Consider the function 

/ = E E "."'tf 

If the construction stops at a given iteration N then / takes its values in {—1,1} 
and the writing convention (W) is fulfilled since every cells I^^ such that a\^^ ^ 
has a neighboring cell associated to a coefficient non equals to with an opposite 
value. Otherwise, for any integer j 0, the number of coefficient for k e 
non equals to is [w(j)J and the total mass of cells such that cl^ ^ 
is Z)jeN S)ke/d(j) ^"■''^card |k e /^(j) : a^^^ ^ o| which is greater or equal to 1 by 
assumption. Thus, all the hypercube is filled by cells associated to coefficients non 
equal to 0. So / takes its values in {—1, 1} and the writing convention (W) is 
fulfilled since every cells x'^^ such that a^^^ ^ has a neighboring cell associated to 
a coefficient non equals to with an opposite value. Moreover / ^ I[o,i]d- 

Proof of Theorem 2. Let tt = (-P^, rf) be a probability measure on A' x {—1, 1} 
belonging to Vw,a- Denote by /* a Bayes classifier associated to tt (for example 
/* = sign(2r7 — 1)) ■ We have 

ci.(/,r) = (l/2)E[|2r;(X)-l||/(X)-r(X)|] < (yl/2)||/-r|Ui(,,). 

Let e > 0. Define by Jg the smallest integer satisfying 

We write /* in the fundamental system G N, k e Id{j)) using the convention 

of writing of section 3.1 but we start at the level of frequency J^: 

+00 

/• = E '^l.-'-Vi,-'-' + E E 

We consider 

ke/d(J.) 
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where 



and 



—1 otherwise 



F{Y = 1\X e ll^^'^) = 



, dP^jx) 



(8) 



(9) 



for all k e Id{Je)- Note that, if A^'^ ^ then A^^'^ = S^-^^^ moreover /* take its 
values in {—1,1}, thus ,we have 



u-rUix,) = E [ ^jfix) - Mx)\dx + ijf 



{x) - fe{x)\dx 



+ 00 

< 2-'^^^+^card |k e h{Je) ■■ ^L*^'^ = o} < 2 ^ 2-'^^\_w{j)\ < 2e/A. 



j = Je + l 

Proof of Theorem 3. Let vr = {P^ , 1]) be a probability measure onA:'x{ — 1,1} 
satisfying (Al), (SMA) and such that /* = sign(2r7 — 1), a Bayes classifier associated 
to TT, belongs to J-'w^ (a L-*^— ball of Bayes rules). 

Let e > and the smallest integer satisfying ^~^^j^^i2~^^ lw{j)\ < e/A. We 
decompose the risk in the bias term and variance term: 

s{fY^^) = E [dMi''\ n] < E [dAfi''\ fe)] + dM, n, 

where fri^'^ is introduced in (4) and in (7). 

Using the definition of and according to the approximation Theorem (Theo- 
rem 1), the bias term satisfies: 

dMe,n<e. 

For the variance term we have (using the notations introduced in (4) and (8)): 



E 



< ^E 
- 2 



|/,(x)-/W(x)|dP^(x) 



2 ^ E E[|Bi-'-'-i<-'-'|l<^ E p(|4«-iL-'-'l = 2) 



r(Je) 



A 
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Let k e Id{Je)- For any m G {0, .... n}, we introduce the sets 



(m) 



|Card{i e {l,...,n} : XiG 4^^^} = m} 



and 
Qk = 
We have 

and 



card<(i e {l,...,n} : ^ 



< card<^ i e {l,...,n} : ' ^ ' 



(0)c 



P(oi°^'^ n Ok) = 5^ n^t"^ n Ok) = 5^ p(Ok|of V(o(™^^ 

m=l m=l 

Moreover, denote by Zi, . . . , Z„ some variables i.i.d. with a BernouUi with pa- 
rameter pI/"-* for common probabihty distribution (Pk^'^ is introduced in (9) and is 



equal to P(y = l\X e 4"^'^)), we have for any m = 1, . . . , 



n, 



Concentration inequality of Hoeffding leads to 

P 



^ > p'^''^ + < exp(-2mt2) and ^ ^ < pL"^'^ ~ ^) ~ ^^P(~2mt2), 

(10) 

for alH > and m = 1, . . . ,n. 

Denote by a'^'^ the probability P (^X e 4'^'^). If pj;^'^ > 1/2, applying second 
inequality of (10) leads to 



■1) 



< 



m=l 



j=i J v 

n 

< ^exp(-2m(p^^^)- 1/2)2) 

m=0 

= (l - - exp(-2(pt^^) - 1/2)2)) 

< exp (^-na{l - exp(-2(p^^^^ - 1/2) 2) )2-'^^^) 
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4'^^)"^(i - «L'^^) 



n 
m 



(Jc)v«- 



If p"^"^^ < 1/2 then similar arguments used in tiie previous case and first inequality 
of (10) lead to 

P (\B^^'^ - ij/^^l = 2) = P(ijf^^ = 1) 



< 



exp i^-na{l - exp(-2(p^-^'^ - 1/2)=^)) 2-'^'^^) . 



If = 1/2, we use P (\b''^''^ - ij/'^l = 2 ) < 1. Like in the proof of Theorem 2, 
we use the writing 



+00 



a 



/•= E ^•'^"+ E E 

Since P^{rj = 1/2) = 0, if A'^'^ ^ then p'^'^ ^ 1/2. Thus, the variance term 
satisfies: 



E 



< 



d-K^fni fe ) 
/ 

A 



2dJ< 

A 



\ 



^ ^ E exp(-na(l-exp(-2(pi'^^)-l/2)^))2-'^-^^)+Ae. 



If A^^'^ ^ then 77 > 1/2 or 77 < 1/2 over the whole set so 



ri{x) 



dP^ (x 



Moreover tt satisfies P (|277(X) - 1| > /i) = 1, so 



Pk 



> 



h 



We have shown that for all e > 0, 
S{fn) = E k(A, /*)] < (1 + A)e + exp (-na(l - exp(-2(/i/2)2))2-'^^^) , 
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where is the smallest integer satisfying ^^^^+1 2 '^•'[w{j)\ < e/A. 

Proof of Theorem 4. For all g G N we consider Gq a net of [0, 1]'^ defined by: 

G,^{(^.....^)a.......Me{o.....2'-i4 

and the function rjg from [0, 1]'' to Gq such that r]q{x) is the closest point of Gq 
from X (in the case of ex aequo, we choose the smallest point for the usual order 
on R'^). Associated to this grid, the partition . . . , X'^'^l of [0, l]'' is defined 

hy x,y e X'l'^^ iff r)q{x) = r)q{y) and we use a special indexation for this partition: 
denote by x'g...,,^ = . . . , ^) and we say that x'g...,,^ -< xf^_,,^ if 

or 

Vq-i{x'kl...,kJ = Vq-i{x'k^,,...,k'J and {ki, ...,kd)< (/c'l, . . . , k'a), 

for the usual order on N'^. Thus, the partition (X'j'^ : j — 1, . . . , 2*^*) has an increasing 
indexation according to the order of {x'^j^^^ for the order defined above. This 
order take care of the previous partition by splitting blocks in the right given order 
and inside a block of a partition we take the natural order of N*^. We introduce an 
other parameter m e {1, . . . , 2'^^} and we define for alH = 1, . . . , m, x}"^ = xf^ and 
A'q^^ — [0, l]*^ — U^^A"/^^. Parameters q and m will be chosen later. We consider W e 
[0, m~^], chosen later, and define the function fx from [0, 1]*^ to Why fx — W/Xd{Xi) 
(where is the Lebesgue measure on [0, 1]'^) on ^"1, . . . , A"^ and (1 — mW)/Xd{Xo) 
on Xq. We denote by the probability distribution on [0, 1]*^ with the density fx 
w.r.t. the Lebesgue measure. For all cr = (cti, . . . , am) E Q — {—1, 1}"* we consider 
rja defined for any x e [0, 1]'' by 

1^ a X e XjJ = l,...,m, 
1 if X e Xq. 

We have a set of probability measures {tTo- : a e Q} on [0,1]'^ x { — 1,1} indexed by 
the hypercube Q where is the marginal on [0, l]*^ of tTo- and rja its conditional 
probabihty function of y = 1 given X. We denote by /* the Bayes rule associated 
to TTo-, we have f^{x) = aj if x E Xj for j = 1, . . . , m and 1 if x e Xq, for any cr e Q. 
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Now we give conditions on g, m and W such tliat for all a in fi, tTo- belongs to 
'Pw,h,a,A- If we take 

W = 2-'^\ (11) 

then « A and Wx E [0, l]'', a < dP^/d\{x) < A. Wc have clearly |27^(a;)-l| > h 
for any x e [0, 1]'^. We can see that f* e for all a e {-1, 1}"* iff 



[1^(9 + 1)J > inf (x e : X > m) 

{2*^-1 if 
inf (a; e 2"'N : x > 2~'^m) otherwise 



m 



< 2'^ 



k(l)J > 

[wm > 1 



2"' - 1 



if 



m 



inf (a; e 2'^N : x >2 '^'^m) otherwise 



Since we have [u'(j)J > 2*^ — 1 for all j > 1 and [w(0)J = 1, and [w{j — 1)J > 
lw(j)\/2'^, then /; e JF^f^ for all cr e Q iff 



[w{q + 1)J > inf (a; E 2'^N : x > m) . 



(12) 



Take q,m and 1^ such that (11) and (12) are fulfilled then, {tTo- : ct G Q} is a 
subset of Vu),h,a,A- Let o" e and /„ be a classifier, we have 



E, 



R{fn)-R*\ = (1/2)E,. |2,7,(X)-1||/„(X)-/:(X)| 



> 



(V2)E.^ [|/„(x)-/;(x)| 



> {h/2)E. 



E/ \fn{x)-f:{x)\dP^{x)+ \Ux)-f:{x)\dP^{x) 

■1=1 </ A'j t/ Aq 



m r « 

> {Wh/2)J2^^. / 



> (iyV2)E. 



'TTo- 



E 



We deduce that 



inf sup S^{fn) > {Wh/2) inf sup E^^ 



E 

.1=1 



CTi — (Ji 
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Now, we control the Hellinger distance between two neighbouring probability 
measures. Let p be the Hamming distance on Vt. Let a, a' in Vt such that p(a, a') = 1. 
We have ^ 

and a straightforward calculus leads to H'^{na, T^a') — 2VF (l — — h?). Take 

W = l/n, (13) 

thus, for any integer n, we have i7^(7rf tt®") < /? < 2 where /3 = 2 (l — cxp(l — Vl — 
The Assouad's Lemma (cf. Lecue [2006c]) yields inf^^g[_i^i]m sup^g^.^ j^j™ E^^ [llT=i \^ 
f (1 - 1)1 We conclude that 

inf sup 8^{fn)>Wh^(l-^\ . 

According to (11), (12) and (13) wetake = 2-'^i = l/n,g = [logn/(dlog2)J ,m 
|_w (|_logn/(dlog2)J + 1)J — (2^^ — 1). For these values we have 

inf sup £,(/„) > Con-' ([^ (Llogn/(dlog2)J + 1)J - (2'^ - 1)) . 

fn '^^'Pw,h,a,A 

where Co = {h/8) exp (-(1 - Vl - /i^)) ■ 

Proof of CoroUeiry 1: It suffices to apply Theorem 4 to the function w defined 
by w{j) — 2*^^ for any integer j and a = A = 1 for — A^. 

Proof of Theorem 5: 

1. If we assume that > K then Ejw,+i 2"'^-'' rf(i)J = (2'^^)/(2'^^'(2'^ - 1)). 
We take 

^ r iog((^2-'^)/(6(2'^-l))) 
dlog2 

and e„ the unique solution of (1 + A)e„ = exp(— nCe„), where C — a{l - 
e-hV^){2d - l)[A2<^(^+i)]-i. Thus, e„ < (logn)/(Cn). For J„ = J,„, we have 

logn 



n 

for any integer n such that logn > 2'^*^^+^^ (2*^ — 1)~^ and Jn > K, where 

CKAh,a,A^2{l + A)/C. 
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If wc have [logn/(rflog2)J > 2 then [w ([logn/(rflog2)J + 1)J - (2"^ - 1) > 
2'^, so we obtain the lower bound with the constant Cq^k = '^'^Cq and if 
[logn/((ilog2)J > K the constant can be Cq^k = Cq{2'^^ - {2'^ - 1)). 

2. Ifwehave > A^('^)(Q;),thenE,w,+i2-*rf^(j)J < (2<i-°)-^^(2<^(i-°)-l))-i. 
We take 

J ^ r iog(A/(6(2'^(i-")-i))) - 

d{l-a)\og2 

Denote by the unique solution of (1 + A)e„ = exp^—nCen^^"""^) where C = 
a(l - e-^'/2)2-'^(A-i(2'^(i-") - We have e„ < (logn/(nC))i-". For 

Jn = Jen, we have 



2{1 + A)A 



a{l-e-'^y^) 



l-a /, X 1-a 

logn 
n 



For the lower bound we have for any integer n, 

inf sup £^{fn) > Co max (l,n-^ (2V - (2'^ - 1))) . 

Proof of Theorem 6: Let e > 0. Denote by eo the greatest positive number 
satisfying 5(eo)eQ < e. Consider A'"(eo) = eo, ||.||oo) and xi, . . . ,XN{eo) ^ 

such that dA C Boo{xj, eo). Since 2~^^o > only nine dyadic sets of frequency 

Jeo can be used to cover a ball of radius eo for the infinity norm of IR^. Thus, we 
only need 9A'"(eo) dyadic sets of frequency Jg(, to cover dA. Consider the partition 
of [0, 1]^ by dyadic sets of frequency J^^. Except on the 9A^(eo) dyadic sets used to 
cover the border dA, the prediction rule Ja is constant, equal to 1 or —1, on the 
other dyadic sets. Thus, by taking = E^L'i «£fe!</'Sfe!' where aj^'[;^^ is equal 
to one value of Ja in the dyadic set , we have 



ll/eo - /a||li(a.) < 97V(eo)2-2''^o < 365(eo)e^ < 36e 
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